Search CORE

68 research outputs found

Sorting suffixes of a text via its Lyndon Factorization

Author: Mantaci Sabrina
Restivo Antonio
Rosone Giovanna
Sciortino Marinella
Publication venue
Publication date: 01/01/2013
Field of study

The process of sorting the suffixes of a text plays a fundamental role in Text Algorithms. They are used for instance in the constructions of the Burrows-Wheeler transform and the suffix array, widely used in several fields of Computer Science. For this reason, several recent researches have been devoted to finding new strategies to obtain effective methods for such a sorting. In this paper we introduce a new methodology in which an important role is played by the Lyndon factorization, so that the local suffixes inside factors detected by this factorization keep their mutual order when extended to the suffixes of the whole word. This property suggests a versatile technique that easily can be adapted to different implementative scenarios.Comment: Submitted to the Prague Stringology Conference 2013 (PSC 2013

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Università di Palermo

Lightweight LCP Construction for Very Large Collections of Strings

Author: Cox Anthony J.
Garofalo Fabio
Rosone Giovanna
Sciortino Marinella
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

The longest common prefix array is a very advantageous data structure that, combined with the suffix array and the Burrows-Wheeler transform, allows to efficiently compute some combinatorial properties of a string useful in several applications, especially in biological contexts. Nowadays, the input data for many problems are big collections of strings, for instance the data coming from "next-generation" DNA sequencing (NGS) technologies. In this paper we present the first lightweight algorithm (called extLCP) for the simultaneous computation of the longest common prefix array and the Burrows-Wheeler transform of a very large collection of strings having any length. The computation is realized by performing disk data accesses only via sequential scans, and the total disk space usage never needs more than twice the output size, excluding the disk space required for the input. Moreover, extLCP allows to compute also the suffix array of the strings of the collection, without any other further data structure is needed. Finally, we test our algorithm on real data and compare our results with another tool capable to work in external memory on large collections of strings.Comment: This manuscript version is made available under the CC-BY-NC-ND 4.0 license http://creativecommons.org/licenses/by-nc-nd/4.0/ The final version of this manuscript is in press in Journal of Discrete Algorithm

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Archivio istituzionale della ricerca - Università di Palermo

Cyclic Complexity of Words

Author: Cassaigne Julien
Fici Gabriele
Sciortino Marinella
Zamboni Luca Q.
Publication venue
Publication date: 28/06/2016
Field of study

We introduce and study a complexity function on words

c_x(n),

called \emph{cyclic complexity}, which counts the number of conjugacy classes of factors of length

n

of an infinite word

x.

We extend the well-known Morse-Hedlund theorem to the setting of cyclic complexity by showing that a word is ultimately periodic if and only if it has bounded cyclic complexity. Unlike most complexity functions, cyclic complexity distinguishes between Sturmian words of different slopes. We prove that if

x

is a Sturmian word and

y

is a word having the same cyclic complexity of

x,

then up to renaming letters,

x

and

y

have the same set of factors. In particular,

y

is also Sturmian of slope equal to that of

x.

Since

c_x(n)=1

for some

n\geq 1

implies

x

is periodic, it is natural to consider the quantity

\liminf_{n\rightarrow \infty} c_x(n).

We show that if

x

is a Sturmian word, then

\liminf_{n\rightarrow \infty} c_x(n)=2.

We prove however that this is not a characterization of Sturmian words by exhibiting a restricted class of Toeplitz words, including the period-doubling word, which also verify this same condition on the limit infimum. In contrast we show that, for the Thue-Morse word

t

\liminf_{n\rightarrow \infty} c_t(n)=+\infty.

Comment: To appear in Journal of Combinatorial Theory, Series

arXiv.org e-Print Archive

HAL-UJM

HAL AMU

Hal-Diderot

HAL-Ecole des Ponts ParisTech

Archivio istituzionale della ricerca - Università di Palermo

HAL - UPEC / UPEM

Universal Lyndon Words

Author: Carpi Arturo
Fici Gabriele
Holub Stepan
Oprsal Jakub
Sciortino Marinella
Publication venue
Publication date: 01/01/2014
Field of study

A word

w

over an alphabet

\Sigma

is a Lyndon word if there exists an order defined on

\Sigma

for which

w

is lexicographically smaller than all of its conjugates (other than itself). We introduce and study \emph{universal Lyndon words}, which are words over an

n

-letter alphabet that have length

n!

and such that all the conjugates are Lyndon words. We show that universal Lyndon words exist for every

n

and exhibit combinatorial and structural properties of these words. We then define particular prefix codes, which we call Hamiltonian lex-codes, and show that every Hamiltonian lex-code is in bijection with the set of the shortest unrepeated prefixes of the conjugates of a universal Lyndon word. This allows us to give an algorithm for constructing all the universal Lyndon words.Comment: To appear in the proceedings of MFCS 201

arXiv.org e-Print Archive

Crossref

University of Birmingham Research Portal

Archivio istituzionale della ricerca - Università di Palermo

On the Impact of Morphisms on BWT-Runs

Author: Fici Gabriele
Romana Giuseppe
Sciortino Marinella
Urbina Cristian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 34th Annual Symposium on Combinatorial Pattern Matching (CPM 2023)
Publication date: 01/01/2023
Field of study

Morphisms are widely studied combinatorial objects that can be used for generating infinite families of words. In the context of Information theory, injective morphisms are called (variable length) codes. In Data compression, the morphisms, combined with parsing techniques, have been recently used to define new mechanisms to generate repetitive words. Here, we show that the repetitiveness induced by applying a morphism to a word can be captured by a compression scheme based on the Burrows-Wheeler Transform (BWT). In fact, we prove that, differently from other compression-based repetitiveness measures, the measure r_bwt (which counts the number of equal-letter runs produced by applying BWT to a word) strongly depends on the applied morphism. More in detail, we characterize the binary morphisms that preserve the value of r_bwt(w), when applied to any binary word w containing both letters. They are precisely the Sturmian morphisms, which are well-known objects in Combinatorics on words. Moreover, we prove that it is always possible to find a binary morphism that, when applied to any binary word containing both letters, increases the number of BWT-equal letter runs by a given (even) number. In addition, we derive a method for constructing arbitrarily large families of binary words on which BWT produces a given (even) number of new equal-letter runs. Such results are obtained by using a new class of morphisms that we call Thue-Morse-like. Finally, we show that there exist binary morphisms ? for which it is possible to find words w such that the difference r_bwt(?(w))-r_bwt(w) is arbitrarily large

Dagstuhl Research Online Publication Server

Detecting Mutations by eBWT

Author: Pisanti Nadia
PREZZA NICOLA
Rosone Giovanna
Sciortino Marinella
Publication venue: place:Leibniz
Publication date: 01/01/2018
Field of study

In this paper we develop a theory describing how the extended Burrows-Wheeler Transform (EBWT) of a collection of DNA fragments tends to cluster together the copies of nucleotides sequenced from a genome G. Our theory accurately predicts how many copies of any nucleotide are expected inside each such cluster, and how an elegant and precise LCP array based procedure can locate these clusters in the EBWT. Our findings are very general and can be applied to a wide range of different problems. In this paper, we consider the case of alignment-free and reference-free SNPs discovery in multiple collections of reads. We note that, in accordance with our theoretical results, SNPs are clustered in the EBWT of the reads collection, and we develop a tool finding SNPs with a simple scan of the EBWT and LCP arrays. Preliminary results show that our method requires much less coverage than state-of-the-art tools while drastically improving precision and sensitivity

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

Archivio istituzionale della ricerca - Università degli Studi di Venezia Ca' Foscari

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Archivio istituzionale della ricerca - Università di Palermo

Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transform

Author: Giovanna Rosone
Marinella Sciortino
Nadia Pisanti
Nicola Prezza
Publication venue
Publication date: 01/01/2019
Field of study

Lightweight Reference-Free Variation Detection using the Burrows-Wheeler Transfor

Archivio della Ricerca - Università di Pisa

A New Class of Searchable and Provably Highly Compressible String Transformations

Author: Giancarlo Raffaele
Manzini Giovanni
Rosone Giovanna
Sciortino Marinella
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 30th Annual Symposium on Combinatorial Pattern Matching (CPM 2019)
Publication date: 01/01/2019
Field of study

The Burrows-Wheeler Transform is a string transformation that plays a fundamental role for the design of self-indexing compressed data structures. Over the years, researchers have successfully extended this transformation outside the domains of strings. However, efforts to find non-trivial alternatives of the original, now 25 years old, Burrows-Wheeler string transformation have met limited success. In this paper we bring new lymph to this area by introducing a whole new family of transformations that have all the "myriad virtues" of the BWT: they can be computed and inverted in linear time, they produce provably highly compressible strings, and they support linear time pattern search directly on the transformed string. This new family is a special case of a more general class of transformations based on context adaptive alphabet orderings, a concept introduced here. This more general class includes also the Alternating BWT, another invertible string transforms recently introduced in connection with a generalization of Lyndon words

arXiv.org e-Print Archive

Archivio della Ricerca - Università di Pisa

Dagstuhl Research Online Publication Server

Archivio Istituzionale della Ricerca- Università del Piemonte Orientale

Archivio istituzionale della ricerca - Università di Palermo

Computing the original eBWT faster, simpler, and with less memory

Author: Boucher Christina
Cenzato Davide
Lipták Zsuzsanna
Rossi Massimiliano
Sciortino Marinella
Publication venue
Publication date: 01/01/2021
Field of study

Mantaci et al. [TCS 2007] defined the eBWT to extend the definition of the BWT to a collection of strings, however, since this introduction, it has been used more generally to describe any BWT of a collection of strings and the fundamental property of the original definition (i.e., the independence from the input order) is frequently disregarded. In this paper, we propose a simple linear-time algorithm for the construction of the original eBWT, which does not require the preprocessing of Bannai et al. [CPM 2021]. As a byproduct, we obtain the first linear-time algorithm for computing the BWT of a single string that uses neither an end-of-string symbol nor Lyndon rotations. We combine our new eBWT construction with a variation of prefix-free parsing to allow for scalable construction of the eBWT. We evaluate our algorithm (pfpebwt) on sets of human chromosomes 19, Salmonella, and SARS-CoV2 genomes, and demonstrate that it is the fastest method for all collections, with a maximum speedup of 7.6x on the second best method. The peak memory is at most 2x larger than the second best method. Comparing with methods that are also, as our algorithm, able to report suffix array samples, we obtain a 57.1x improvement in peak memory. The source code is publicly available at https://github.com/davidecenzato/PFP-eBWT.Comment: 20 pages, 5 figures, 1 tabl

arXiv.org e-Print Archive

Catalogo dei prodotti della ricerca